Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 25
Filter
1.
Comput Methods Programs Biomed ; 250: 108166, 2024 Jun.
Article in English | MEDLINE | ID: mdl-38614026

ABSTRACT

BACKGROUND AND OBJECTIVE: Critically ill children may suffer from impaired neurocognitive functions years after ICU (intensive care unit) discharge. To assess neurocognitive functions, these children are subjected to a fixed sequence of tests. Undergoing all tests is, however, arduous for former pediatric ICU patients, resulting in interrupted evaluations where several neurocognitive deficiencies remain undetected. As a solution, we propose using machine learning to predict the optimal order of tests for each child, reducing the number of tests required to identify the most severe neurocognitive deficiencies. METHODS: We have compared the current clinical approach against several machine learning methods, mainly multi-target regression and label ranking methods. We have also proposed a new method that builds several multi-target predictive models and combines the outputs into a ranking that prioritizes the worse neurocognitive outcomes. We used data available at discharge, from children who participated in the PEPaNIC-RCT trial (ClinicalTrials.gov-NCT01536275), as well as data from a 2-year follow-up study. The institutional review boards at each participating site have also approved this follow-up study (ML8052; NL49708.078; Pro00038098). RESULTS: Our proposed method managed to outperform other machine learning methods and also the current clinical practice. Precisely, our method reaches approximately 80% precision when considering top-4 outcomes, in comparison to 65% and 78% obtained by the current clinical practice and the state-of-the-art method in label ranking, respectively. CONCLUSIONS: Our experiments demonstrated that machine learning can be competitive or even superior to the current testing order employed in clinical practice, suggesting that our model can be used to severely reduce the number of tests necessary for each child. Moreover, the results indicate that possible long-term adverse outcomes are already predictable as early as at ICU discharge. Thus, our work can be seen as the first step to allow more personalized follow-up after ICU discharge leading to preventive care rather than curative.


Subject(s)
Intensive Care Units, Pediatric , Machine Learning , Humans , Child , Male , Female , Child, Preschool , Critical Illness , Follow-Up Studies , Patient Discharge
2.
Artif Intell Med ; 150: 102817, 2024 Apr.
Article in English | MEDLINE | ID: mdl-38553157

ABSTRACT

Intubation for mechanical ventilation (MV) is one of the most common high-risk procedures performed in Intensive Care Units (ICUs). Early prediction of intubation may have a positive impact by providing timely alerts to clinicians and consequently avoiding high-risk late intubations. In this work, we propose a new machine learning method to predict the time to intubation during the first five days of ICU admission, based on the concept of cure survival models. Our approach combines classification and survival analysis, to effectively accommodate the fraction of patients not at risk of intubation, and provide a better estimate of time to intubation, for patients at risk. We tested our approach and compared it to other predictive models on a dataset collected from a secondary care hospital (AZ Groeninge, Kortrijk, Belgium) from 2015 to 2021, consisting of 3425 ICU stays. Furthermore, we utilised SHAP for feature importance analysis, extracting key insights into the relative significance of variables such as vital signs, blood gases, and patient characteristics in predicting intubation in ICU settings. The results corroborate that our approach improves the prediction of time to intubation in critically ill patients, by using routinely collected data within the first hours of admission in the ICU. Early warning of the need for intubation may be used to help clinicians predict the risk of intubation and rank patients according to their expected time to intubation.


Subject(s)
Critical Care , Hospitalization , Humans , Intensive Care Units , Intubation , Machine Learning , Critical Illness , Retrospective Studies
3.
Sci Rep ; 13(1): 9864, 2023 06 18.
Article in English | MEDLINE | ID: mdl-37331979

ABSTRACT

Acute Kidney Injury (AKI) is a sudden episode of kidney failure that is frequently seen in critically ill patients. AKI has been linked to chronic kidney disease (CKD) and mortality. We developed machine learning-based prediction models to predict outcomes following AKI stage 3 events in the intensive care unit. We conducted a prospective observational study that used the medical records of ICU patients diagnosed with AKI stage 3. A random forest algorithm was used to develop two models that can predict patients who will progress to CKD after three and six months of experiencing AKI stage 3. To predict mortality, two survival prediction models have been presented using random survival forests and survival XGBoost. We evaluated established CKD prediction models using AUCROC, and AUPR curves and compared them with the baseline logistic regression models. The mortality prediction models were evaluated with an external test set, and the C-indices were compared to baseline COXPH. We included 101 critically ill patients who experienced AKI stage 3. To increase the training set for the mortality prediction task, an unlabeled dataset has been added. The RF (AUPR: 0.895 and 0.848) and XGBoost (c-index: 0.8248) models have a better performance than the baseline models in predicting CKD and mortality, respectively Machine learning-based models can assist clinicians in making clinical decisions regarding critically ill patients with severe AKI who are likely to develop CKD following discharge. Additionally, we have shown better performance when unlabeled data are incorporated into the survival analysis task.


Subject(s)
Acute Kidney Injury , Renal Insufficiency, Chronic , Humans , Critical Illness , Prospective Studies , Acute Kidney Injury/diagnosis , Machine Learning
4.
BMC Nephrol ; 24(1): 133, 2023 05 09.
Article in English | MEDLINE | ID: mdl-37161365

ABSTRACT

BACKGROUND: Acute Kidney Injury (AKI) is frequently seen in hospitalized and critically ill patients. Studies have shown that AKI is a risk factor for the development of acute kidney disease (AKD), chronic kidney disease (CKD), and mortality. METHODS: A systematic review is performed on validated risk prediction models for developing poor renal outcomes after AKI scenarios. Medline, EMBASE, Cochrane, and Web of Science were searched for articles that developed or validated a prediction model. Moreover, studies that report prediction models for recovery after AKI also have been included. This review was registered with PROSPERO (CRD42022303197). RESULT: We screened 25,812 potentially relevant abstracts. Among the 149 remaining articles in the first selection, eight met the inclusion criteria. All of the included models developed more than one prediction model with different variables. The models included between 3 and 28 independent variables and c-statistics ranged from 0.55 to 1. CONCLUSION: Few validated risk prediction models targeting the development of renal insufficiency after experiencing AKI have been developed, most of which are based on simple statistical or machine learning models. While some of these models have been externally validated, none of these models are available in a way that can be used or evaluated in a clinical setting.


Subject(s)
Acute Kidney Injury , Renal Insufficiency, Chronic , Humans , Acute Kidney Injury/diagnosis , Kidney , Machine Learning , Risk Factors
5.
Comput Biol Med ; 152: 106423, 2023 01.
Article in English | MEDLINE | ID: mdl-36529023

ABSTRACT

With the development of new sequencing technologies, availability of genomic data has grown exponentially. Over the past decade, numerous studies have used genomic data to identify associations between genes and biological functions. While these studies have shown success in annotating genes with functions, they often assume that genes are completely annotated and fail to take into account that datasets are sparse and noisy. This work proposes a method to detect missing annotations in the context of hierarchical multi-label classification. More precisely, our method exploits the relations of functions, represented as a hierarchy, by computing probabilities based on the paths of functions in the hierarchy. By performing several experiments on a variety of rice (Oriza sativa Japonica), we showcase that the proposed method accurately detects missing annotations and yields superior results when compared to state-of-art methods from the literature.


Subject(s)
Genomics , Gene Ontology , Molecular Sequence Annotation , Probability
6.
Behav Res Methods ; 55(4): 2109-2124, 2023 Jun.
Article in English | MEDLINE | ID: mdl-35819719

ABSTRACT

To obtain more accurate and robust feedback information from the students' assessment outcomes and to communicate it to students and optimize teaching and learning strategies, educational researchers and practitioners must critically reflect on whether the existing methods of data analytics are capable of retrieving the information provided in the database. This study compared and contrasted the prediction performance of an item response theory method, particularly the use of an explanatory item response model (EIRM), and six supervised machine learning (ML) methods for predicting students' item responses in educational assessments, considering student- and item-related background information. Each of seven prediction methods was evaluated through cross-validation approaches under three prediction scenarios: (a) unrealized responses of new students to existing items, (b) unrealized responses of existing students to new items, and (c) missing responses of existing students to existing items. The results of a simulation study and two real-life assessment data examples showed that employing student- and item-related background information in addition to the item response data substantially increases the prediction accuracy for new students or items. We also found that the EIRM is as competitive as the best performing ML methods in predicting the student performance outcomes for the educational assessment datasets.


Subject(s)
Educational Measurement , Students , Humans , Computer Simulation , Educational Status , Machine Learning
7.
IEEE Trans Neural Netw Learn Syst ; 34(10): 6755-6767, 2023 10.
Article in English | MEDLINE | ID: mdl-36269923

ABSTRACT

Data production has followed an increased growth in the last years, to the point that traditional or batch machine-learning (ML) algorithms cannot cope with the sheer volume of generated data. Stream or online ML presents itself as a viable solution to deal with the dynamic nature of streaming data. Besides coping with the inherent challenges of streaming data, online ML solutions must be accurate, fast, and bear a reduced memory footprint. We propose a new decision tree-based ensemble algorithm for online ML regression named online extra trees (OXT). Our proposal takes inspiration from the batch learning extra trees (XT) algorithm, a popular and faster alternative to random forest (RF). While speed and memory costs might not be a central concern in most batch applications, they become crucial in data stream data learning. Our proposal combines subbagging (sampling without replacement), random tree split points, and model trees to deliver competitive prediction errors and reduced computational costs. Throughout an extensive experimental evaluation comprising 22 real-world and synthetic datasets, we compare OXT against the state-of-the-art adaptive RF (ARF) and other incremental regressors. OXT is generally more accurate than its competitors while running significantly faster than ARF and expending significantly less memory.


Subject(s)
Algorithms , Neural Networks, Computer , Machine Learning , Random Forest
8.
J Clin Med ; 11(24)2022 Dec 07.
Article in English | MEDLINE | ID: mdl-36555881

ABSTRACT

Background: Acute kidney injury (AKI) in critically ill patients is associated with a significant increase in mortality as well as long-term renal dysfunction and chronic kidney disease (CKD). Serum creatinine (SCr), the most widely used biomarker to evaluate kidney function, does not always accurately predict the glomerular filtration rate (GFR), since it is affected by some non-GFR determinants such as muscle mass and recent meat ingestion. Researchers and clinicians have gained interest in cystatin C (CysC), another biomarker of kidney function. The study objective was to compare GFR estimation using SCr and CysC in detecting CKD over a 1-year follow-up after an AKI stage-3 event in the ICU, as well as to analyze the association between eGFR (using SCr and CysC) and mortality after the AKI event. Method: This prospective observational study used the medical records of ICU patients diagnosed with AKI stage 3. SCr and CysC were measured twice during the ICU stay and four times following diagnosis of AKI. The eGFR was calculated using the EKFC equation for SCr and FAS equation for CysC in order to check the prevalence of CKD (defined as eGFR < 60 mL/min/1.73 m2). Results: The study enrolled 101 patients, 36.6% of whom were female, with a median age of 74 years (30−92), and a median length of stay of 14.5 days in intensive care. A significant difference was observed in the estimation of GFR when comparing formulas based on SCrand CysC, resulting in large differences in the prediction of CKD. Three months after the AKI event, eGFRCysC < 25 mL/min/1.73 m2 was a predictive factor of mortality later on; however, this was not the case for eGFRSCr. Conclusion: The incidence of CKD was highly discrepant with eGFRCysC versus eGFRSCr during the follow-up period. CysC detects more CKD events compared to SCr in the follow-up phase and eGFRCysC is a predictor for mortality in follow-up but not eGFRSCr. Determining the proper marker to estimate GFR in the post-ICU period in AKI stage-3 populations needs further study to improve risk stratification.

9.
J Heart Lung Transplant ; 41(7): 928-936, 2022 07.
Article in English | MEDLINE | ID: mdl-35568604

ABSTRACT

BACKGROUND: Outcome prediction following heart transplant is critical to explaining risks and benefits to patients and decision-making when considering potential organ offers. Given the large number of potential variables to be considered, this task may be most efficiently performed using machine learning (ML). We trained and tested ML and statistical algorithms to predict outcomes following cardiac transplant using the United Network of Organ Sharing (UNOS) database. METHODS: We included 59,590 adult and 8,349 pediatric patients enrolled in the UNOS database between January 1994 and December 2016 who underwent cardiac transplantation. We evaluated 3 classification and 3 survival methods. Algorithms were evaluated using shuffled 10-fold cross-validation (CV) and rolling CV. Predictive performance for 1 year and 90 days all-cause mortality was characterized using the area under the receiver-operating characteristic curve (AUC) with 95% confidence interval. RESULTS: In total, 8,394 (12.4%) patients died within 1 year of transplant. For predicting 1-year survival, using the shuffled 10-fold CV, Random Forest achieved the highest AUC (0.893; 0.889-0.897) followed by XGBoost and logistic regression. In the rolling CV, prediction performance was more modest and comparable among the models with XGBoost and Logistic regression achieving the highest AUC 0.657 (0.647-0.667) and 0.641(0.631-0.651), respectively. There was a trend toward higher prediction performance in pediatric patients. CONCLUSIONS: Our study suggests that ML and statistical models can be used to predict mortality post-transplant, but based on the results from rolling CV, the overall prediction performance will be limited by temporal shifts inpatient and donor selection.


Subject(s)
Heart Transplantation , Machine Learning , Adult , Algorithms , Child , Databases, Factual , Humans , ROC Curve
10.
Comput Biol Med ; 141: 105001, 2022 02.
Article in English | MEDLINE | ID: mdl-34782112

ABSTRACT

Many clinical studies follow patients over time and record the time until the occurrence of an event of interest (e.g., recovery, death, …). When patients drop out of the study or when their event did not happen before the study ended, the collected dataset is said to contain censored observations. Given the rise of personalized medicine, clinicians are often interested in accurate risk prediction models that predict, for unseen patients, a survival profile, including the expected time until the event. Survival analysis methods are used to detect associations or compare subpopulations of patients in this context. In this article, we propose to cast the time-to-event prediction task as a multi-target regression task, with censored observations modeled as partially labeled examples. We then apply semi-supervised learning to the resulting data representation. More specifically, we use semi-supervised predictive clustering trees and ensembles thereof. Empirical results over eleven real-life datasets demonstrate superior or equivalent predictive performance of the proposed approach as compared to three competitor methods. Moreover, smaller models are obtained compared to random survival forests, another tree ensemble method. Finally, we illustrate the informative feature selection mechanism of our method, by interpreting the splits induced by a single tree model when predicting survival for amyotrophic lateral sclerosis patients.


Subject(s)
Supervised Machine Learning , Cluster Analysis , Humans , Multivariate Analysis , Survival Analysis
11.
Eur Heart J Digit Health ; 2(3): 390-400, 2021 Sep.
Article in English | MEDLINE | ID: mdl-36713600

ABSTRACT

Aims: There is a need for better phenotypic characterization of the asymptomatic stages of cardiac maladaptation. We tested the hypothesis that an unsupervised clustering analysis utilizing echocardiographic indexes reflecting left heart structure and function could identify phenotypically distinct groups of asymptomatic individuals in the general population. Methods and results: We prospectively studied 1407 community-dwelling individuals (mean age, 51.2 years; 51.1% women), in whom we performed clinical and echocardiographic examination at baseline and collected cardiac events on average 8.8 years later. Cardiac phenotypes that were correlated at r > 0.8 were filtered, leaving 21 echocardiographic features, and systolic blood pressure for phenogrouping. We employed hierarchical and Gaussian mixture model-based clustering. Cox regression was used to demonstrate the clinical validity of constructed phenogroups. Unsupervised clustering analyses classified study participants into three distinct phenogroups that differed markedly in echocardiographic indexes. Indeed, cluster 3 had the worst left ventricular (LV) diastolic function (i.e. lowest e' velocity and left atrial (LA) reservoir strain, highest E/e', and LA volume index) and LV remodelling. The phenogroups were also different in cardiovascular risk factor profiles. We observed increase in the risk for incidence of adverse events across phenogroups. In the third phenogroup, the multivariable adjusted risk was significantly higher than the average population risk for major cardiovascular events (51%, P = 0.0028). Conclusion: Unsupervised learning algorithms integrating routinely measured cardiac imaging and haemodynamic data can provide a clinically meaningful classification of cardiac health in asymptomatic individuals. This approach might facilitate early detection of cardiac maladaptation and improve risk stratification.

12.
IEEE/ACM Trans Comput Biol Bioinform ; 18(4): 1596-1607, 2021.
Article in English | MEDLINE | ID: mdl-31689203

ABSTRACT

Identifying drug-target interactions is crucial for drug discovery. Despite modern technologies used in drug screening, experimental identification of drug-target interactions is an extremely demanding task. Predicting drug-target interactions in silico can thereby facilitate drug discovery as well as drug repositioning. Various machine learning models have been developed over the years to predict such interactions. Multi-output learning models in particular have drawn the attention of the scientific community due to their high predictive performance and computational efficiency. These models are based on the assumption that all the labels are correlated with each other. However, this assumption is too optimistic. Here, we address drug-target interaction prediction as a multi-label classification task that is combined with label partitioning. We show that building multi-output learning models over groups (clusters) of labels often leads to superior results. The performed experiments confirm the efficiency of the proposed framework.


Subject(s)
Computational Biology/methods , Drug Development/methods , Drug Discovery/methods , Machine Learning
13.
Eur Heart J Cardiovasc Imaging ; 22(10): 1208-1217, 2021 09 20.
Article in English | MEDLINE | ID: mdl-32588036

ABSTRACT

AIMS: Both left ventricular (LV) diastolic dysfunction (LVDD) and hypertrophy (LVH) as assessed by echocardiography are independent prognostic markers of future cardiovascular events in the community. However, selective screening strategies to identify individuals at risk who would benefit most from cardiac phenotyping are lacking. We, therefore, assessed the utility of several machine learning (ML) classifiers built on routinely measured clinical, biochemical, and electrocardiographic features for detecting subclinical LV abnormalities. METHODS AND RESULTS: We included 1407 participants (mean age, 51 years, 51% women) randomly recruited from the general population. We used echocardiographic parameters reflecting LV diastolic function and structure to define LV abnormalities (LVDD, n = 252; LVH, n = 272). Next, four supervised ML algorithms (XGBoost, AdaBoost, Random Forest (RF), Support Vector Machines, and Logistic regression) were used to build classifiers based on clinical data (67 features) to categorize LVDD and LVH. We applied a nested 10-fold cross-validation set-up. XGBoost and RF classifiers exhibited a high area under the receiver operating characteristic curve with values between 86.2% and 88.1% for predicting LVDD and between 77.7% and 78.5% for predicting LVH. Age, body mass index, different components of blood pressure, history of hypertension, antihypertensive treatment, and various electrocardiographic variables were the top selected features for predicting LVDD and LVH. CONCLUSION: XGBoost and RF classifiers combining routinely measured clinical, laboratory, and electrocardiographic data predicted LVDD and LVH with high accuracy. These ML classifiers might be useful to pre-select individuals in whom further echocardiographic examination, monitoring, and preventive measures are warranted.


Subject(s)
Hypertension , Ventricular Dysfunction, Left , Female , Humans , Hypertrophy, Left Ventricular , Machine Learning , Male , Middle Aged , Risk Factors , Ventricular Dysfunction, Left/diagnostic imaging , Ventricular Remodeling
14.
BMC Bioinformatics ; 21(1): 49, 2020 Feb 07.
Article in English | MEDLINE | ID: mdl-32033537

ABSTRACT

BACKGROUND: Computational prediction of drug-target interactions (DTI) is vital for drug discovery. The experimental identification of interactions between drugs and target proteins is very onerous. Modern technologies have mitigated the problem, leveraging the development of new drugs. However, drug development remains extremely expensive and time consuming. Therefore, in silico DTI predictions based on machine learning can alleviate the burdensome task of drug development. Many machine learning approaches have been proposed over the years for DTI prediction. Nevertheless, prediction accuracy and efficiency are persisting problems that still need to be tackled. Here, we propose a new learning method which addresses DTI prediction as a multi-output prediction task by learning ensembles of multi-output bi-clustering trees (eBICT) on reconstructed networks. In our setting, the nodes of a DTI network (drugs and proteins) are represented by features (background information). The interactions between the nodes of a DTI network are modeled as an interaction matrix and compose the output space in our problem. The proposed approach integrates background information from both drug and target protein spaces into the same global network framework. RESULTS: We performed an empirical evaluation, comparing the proposed approach to state of the art DTI prediction methods and demonstrated the effectiveness of the proposed approach in different prediction settings. For evaluation purposes, we used several benchmark datasets that represent drug-protein networks. We show that output space reconstruction can boost the predictive performance of tree-ensemble learning methods, yielding more accurate DTI predictions. CONCLUSIONS: We proposed a new DTI prediction method where bi-clustering trees are built on reconstructed networks. Building tree-ensemble learning models with output space reconstruction leads to superior prediction results, while preserving the advantages of tree-ensembles, such as scalability, interpretability and inductive setting.


Subject(s)
Drug Discovery/methods , Machine Learning , Proteins/drug effects , Cluster Analysis , Computer Simulation , Drug Development
15.
BMC Bioinformatics ; 20(1): 525, 2019 Oct 28.
Article in English | MEDLINE | ID: mdl-31660848

ABSTRACT

BACKGROUND: Network inference is crucial for biomedicine and systems biology. Biological entities and their associations are often modeled as interaction networks. Examples include drug protein interaction or gene regulatory networks. Studying and elucidating such networks can lead to the comprehension of complex biological processes. However, usually we have only partial knowledge of those networks and the experimental identification of all the existing associations between biological entities is very time consuming and particularly expensive. Many computational approaches have been proposed over the years for network inference, nonetheless, efficiency and accuracy are still persisting open problems. Here, we propose bi-clustering tree ensembles as a new machine learning method for network inference, extending the traditional tree-ensemble models to the global network setting. The proposed approach addresses the network inference problem as a multi-label classification task. More specifically, the nodes of a network (e.g., drugs or proteins in a drug-protein interaction network) are modelled as samples described by features (e.g., chemical structure similarities or protein sequence similarities). The labels in our setting represent the presence or absence of links connecting the nodes of the interaction network (e.g., drug-protein interactions in a drug-protein interaction network). RESULTS: We extended traditional tree-ensemble methods, such as extremely randomized trees (ERT) and random forests (RF) to ensembles of bi-clustering trees, integrating background information from both node sets of a heterogeneous network into the same learning framework. We performed an empirical evaluation, comparing the proposed approach to currently used tree-ensemble based approaches as well as other approaches from the literature. We demonstrated the effectiveness of our approach in different interaction prediction (network inference) settings. For evaluation purposes, we used several benchmark datasets that represent drug-protein and gene regulatory networks. We also applied our proposed method to two versions of a chemical-protein association network extracted from the STITCH database, demonstrating the potential of our model in predicting non-reported interactions. CONCLUSIONS: Bi-clustering trees outperform existing tree-based strategies as well as machine learning methods based on other algorithms. Since our approach is based on tree-ensembles it inherits the advantages of tree-ensemble learning, such as handling of missing values, scalability and interpretability.


Subject(s)
Cluster Analysis , Algorithms , Databases, Factual , Gene Regulatory Networks , Machine Learning , Protein Interaction Maps , Proteins/metabolism
16.
BMC Bioinformatics ; 20(1): 485, 2019 Sep 23.
Article in English | MEDLINE | ID: mdl-31547800

ABSTRACT

BACKGROUND: A massive amount of proteomic data is generated on a daily basis, nonetheless annotating all sequences is costly and often unfeasible. As a countermeasure, machine learning methods have been used to automatically annotate new protein functions. More specifically, many studies have investigated hierarchical multi-label classification (HMC) methods to predict annotations, using the Functional Catalogue (FunCat) or Gene Ontology (GO) label hierarchies. Most of these studies employed benchmark datasets created more than a decade ago, and thus train their models on outdated information. In this work, we provide an updated version of these datasets. By querying recent versions of FunCat and GO yeast annotations, we provide 24 new datasets in total. We compare four HMC methods, providing baseline results for the new datasets. Furthermore, we also evaluate whether the predictive models are able to discover new or wrong annotations, by training them on the old data and evaluating their results against the most recent information. RESULTS: The results demonstrated that the method based on predictive clustering trees, Clus-Ensemble, proposed in 2008, achieved superior results compared to more recent methods on the standard evaluation task. For the discovery of new knowledge, Clus-Ensemble performed better when discovering new annotations in the FunCat taxonomy, whereas hierarchical multi-label classification with genetic algorithm (HMC-GA), a method based on genetic algorithms, was overall superior when detecting annotations that were removed. In the GO datasets, Clus-Ensemble once again had the upper hand when discovering new annotations, HMC-GA performed better for detecting removed annotations. However, in this evaluation, there were less significant differences among the methods. CONCLUSIONS: The experiments have showed that protein function prediction is a very challenging task which should be further investigated. We believe that the baseline results associated with the updated datasets provided in this work should be considered as guidelines for future studies, nonetheless the old versions of the datasets should not be disregarded since other tasks in machine learning could benefit from them.


Subject(s)
Machine Learning , Molecular Sequence Annotation/methods , Proteomics/methods , Cluster Analysis , Eukaryota/metabolism , Gene Ontology , Humans
17.
J Biomed Inform ; 85: 40-48, 2018 09.
Article in English | MEDLINE | ID: mdl-30012356

ABSTRACT

The volume of biomedical data available to the machine learning community grows very rapidly. A rational question is how informative these data really are or how discriminant the features describing the data instances are. Several biomedical datasets suffer from lack of variance in the instance representation, or even worse, contain instances with identical features and different class labels. Indisputably, this directly affects the performance of machine learning algorithms, as well as the ability to interpret their results. In this article, we emphasize on the aforementioned problem and propose a target-informed feature induction method based on tree ensemble learning. The method brings more variance into the data representation, thereby potentially increasing predictive performance of a learner applied to the induced features. The contribution of this article is twofold. Firstly, a problem affecting the quality of biomedical data is highlighted, and secondly, a method to handle that problem is proposed. The efficiency of the presented approach is validated on multi-target prediction tasks. The obtained results indicate that the proposed approach is able to boost the discrimination between the data instances and increase the predictive performance.


Subject(s)
Cluster Analysis , Data Mining/methods , Decision Trees , Machine Learning , Algorithms , Computational Biology , Databases, Factual/statistics & numerical data , Escherichia coli/genetics , Escherichia coli/metabolism , Gene Regulatory Networks , Humans , Metabolic Networks and Pathways , Protein Interaction Maps , Saccharomyces cerevisiae/genetics , Saccharomyces cerevisiae/metabolism
18.
PLoS Comput Biol ; 14(4): e1006097, 2018 04.
Article in English | MEDLINE | ID: mdl-29684010

ABSTRACT

Transposable elements (TEs) are repetitive nucleotide sequences that make up a large portion of eukaryotic genomes. They can move and duplicate within a genome, increasing genome size and contributing to genetic diversity within and across species. Accurate identification and classification of TEs present in a genome is an important step towards understanding their effects on genes and their role in genome evolution. We introduce TE-Learner, a framework based on machine learning that automatically identifies TEs in a given genome and assigns a classification to them. We present an implementation of our framework towards LTR retrotransposons, a particular type of TEs characterized by having long terminal repeats (LTRs) at their boundaries. We evaluate the predictive performance of our framework on the well-annotated genomes of Drosophila melanogaster and Arabidopsis thaliana and we compare our results for three LTR retrotransposon superfamilies with the results of three widely used methods for TE identification or classification: RepeatMasker, Censor and LtrDigest. In contrast to these methods, TE-Learner is the first to incorporate machine learning techniques, outperforming these methods in terms of predictive performance, while able to learn models and make predictions efficiently. Moreover, we show that our method was able to identify TEs that none of the above method could find, and we investigated TE-Learner's predictions which did not correspond to an official annotation. It turns out that many of these predictions are in fact strongly homologous to a known TE.


Subject(s)
Machine Learning , Retroelements , Terminal Repeat Sequences , Animals , Arabidopsis/genetics , Arabidopsis Proteins/genetics , Computational Biology , Conserved Sequence , DNA, Plant/genetics , Decision Trees , Drosophila Proteins/genetics , Drosophila melanogaster/genetics , Evolution, Molecular , Genome, Insect , Genome, Plant , Software
19.
Science ; 355(6327): 820-826, 2017 02 24.
Article in English | MEDLINE | ID: mdl-28219971

ABSTRACT

It is still not possible to predict whether a given molecule will have a perceived odor or what olfactory percept it will produce. We therefore organized the crowd-sourced DREAM Olfaction Prediction Challenge. Using a large olfactory psychophysical data set, teams developed machine-learning algorithms to predict sensory attributes of molecules based on their chemoinformatic features. The resulting models accurately predicted odor intensity and pleasantness and also successfully predicted 8 among 19 rated semantic descriptors ("garlic," "fish," "sweet," "fruit," "burnt," "spices," "flower," and "sour"). Regularized linear models performed nearly as well as random forest-based ones, with a predictive accuracy that closely approaches a key theoretical limit. These models help to predict the perceptual qualities of virtually any molecule with high accuracy and also reverse-engineer the smell of a molecule.


Subject(s)
Odorants , Olfactory Perception , Smell , Adult , Datasets as Topic , Humans , Male , Models, Biological
20.
Cytometry A ; 89(1): 16-21, 2016 Jan.
Article in English | MEDLINE | ID: mdl-26447924

ABSTRACT

The Flow Cytometry: Critical Assessment of Population Identification Methods (FlowCAP) challenges were established to compare the performance of computational methods for identifying cell populations in multidimensional flow cytometry data. Here we report the results of FlowCAP-IV where algorithms from seven different research groups predicted the time to progression to AIDS among a cohort of 384 HIV+ subjects, using antigen-stimulated peripheral blood mononuclear cell (PBMC) samples analyzed with a 14-color staining panel. Two approaches (FlowReMi.1 and flowDensity-flowType-RchyOptimyx) provided statistically significant predictive value in the blinded test set. Manual validation of submitted results indicated that unbiased analysis of single cell phenotypes could reveal unexpected cell types that correlated with outcomes of interest in high dimensional flow cytometry datasets.


Subject(s)
Acquired Immunodeficiency Syndrome/pathology , Benchmarking , Computational Biology/methods , Disease Progression , Flow Cytometry/methods , T-Lymphocytes/cytology , Acquired Immunodeficiency Syndrome/diagnosis , Algorithms , Data Interpretation, Statistical , HIV Seropositivity , Humans , Staining and Labeling
SELECTION OF CITATIONS
SEARCH DETAIL
...